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Description 

TECHNICAL FIELD 

5 [0001] The invention pertains to speech recognition, and In particular phone boundary detection in the acoustic input. 
TERMS 

[0002] Symbol: Characterizing acoustic speech based on n features, acoustic speech is viewed In an n-dimensional 
10 acoustic space. The space is partitioned Into regions, each of which Is Identified by an nKjImenslonal prototype vector. 
Each prototype vector Is represented by a 'symbol', such as a number or other Identifier. Uttered speech may be 
viewed as successive 'symbols*. 

[0003] Feneme (also Label): A symbol corresponding to a prototype vector, the symlx)! being defined based on 
features of sound occurring during a fixed inten^al of time. Sound may be characterized as having, for example, twenty 

15 features-the magnitude of each feature during a centisecond Interval corresponding to a prototype vector component 
Each prototype vector thus has a corresponding set of feature values for a centisecond interval. Based on the feature 
values generated during a centisecond interval, one prototype vector from a fixed set of prototype vectors Is selected 
as the closest. With each prototype vector having a con^espondlng feneme (or label), the set of prototype vectors 
corresponds to an alphabet of fenemes (or labels). Sample fenemes are listed in Table 1-the first feneme 001 being 

20 defined as AA11. An acoustic processor examines uttered speech one interval after another and, based on which 
prototype vector is closest by some measure to the feature values, the feneme for the closest prototype vector is 
assigned to the Interval. The feneme Is distinguished from the well-known phoneme In that the fomrier Is based on 
feature values examined over a fixed Interval of time (e.g., a centisecond) whereas the latter Is based on a predefined 
set of basic phonetic sound units without regard to time limitations. 

25 [0004] Markov Model (also probabilistic finite state machine): A sound event can be represented as a collection of 
states connected to one another by transitions which produce symbols from a finite alphabet. Each transition from a 
state to a state has associated with It a probability which is the probability that a transition t will be chosen next when 
a state s Is reached. Also, for each possible label output at a transition, there Is a con-esponding probability The model 
starts in one or more Initial states and ends in one or more final states. 

30 [0005] Phone: A unit of sound for which a Mari^ov model is assigned. A first type of phone is phonetically based, 
each phoneme corresponding to a respective phone. A standard set of phonemes are defined In the International 
Phonetic Alphabet. A second type of phone Is feneme-based, each feneme corresponding to a respective phone. 
[0006] Polling: From a training text, it Is detenmlned how often each label occurs In each vocabulary word. From such 
data, tables are generated in which each label has a vote for each vocabulary word and, optionally, each label has a 

35 penalty for each word. When an acoustic processor generates a string of labels, the votes (and penalties) for each . 
vocabulary word are computed to provide a match value. The process of tallying the votes is "polling'. 
[0007] In some known approaches to speech recognition, words are represented by phone-based Maricov models 
and input speech which, after conversion to a coded sequence of acoustic elements or labels, is decoded by matching 
the label sequences to these models, using probabilistic algorithms such as Vitert)! decoding. 

40 

BACKGROUND 

a: Overview of Speech Recognition 

45 [0008] (1 ) Labeling of Speech Input Signal A preliminary function of this speech recognition system is the conversion 
of the speech input signal into a coded representation. This is done in a procedure that was described for example in 
"Continuous Speech Recognition with Automatically Selected Acoustic Prototypes Obtained by either Bootstrapping 
or Clustering" by A. Nadas et al, Proceedings ICASSP 1981, pp. 1153-1155. 

[0009] In accordance with the Nadas et al conversion procedure, speech input is divided into centisecond intervals. 
50 For each centisecond interval, a spectral analysis of the speech Input Is made. A detemiinatlon Is then made as to 
which of a plurality of predefined spectral patterns the centisecond of speech input most closely corresponds. A "fen- 
eme" that indicates which spectral pattem most closely conforms to the speech input is then assigned to the particular 
centisecond interval. Each feneme, in turn, is represented as a distinct label. 

[0010] A string of labels (or fenemes) thereby represents successive centlseconds of speech which, in turn, form 
S5 words. 

[0011] A typical finite set of labels is shown in Table 1 which is appended to this specification. It comprises about 
200 labels each of which represents an acoustic element. It should be noted that these acoustic elements are shorter 
than ttie usual "phonemes' which roughly represent vowels or consonants of ttie alphabet, i.e.. each phoneme would 
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correspond to a sequence of labeled acoustic elements. 

[0012] An Important feature of this labeling technique is that it can be done automatically on the basis of the acoustic 
signal and thus needs no phonetic Interpretation. The unit which does the conversion from the acoustic input signal to 
a coded representation in the fomn of a label string is called an "acoustic processor". 
5 [0013] (2) Statistical Model Representation of Words 

The basic functions of a speech recognition system in which the present invention can be used will be described here 
briefly though several publications are also available which give more details of such a system, in particular F. Jelinek, 
■Continuous Speech Recognition by Statistical Methods", Proceedings IEEE. Vol. 64, 1976, pp. 532-576. 
[0014] In the system, each word of the recognition vocabulary is represented by a baseform wherein the word is 

10 divided for recognition purposes Into a structure of phones, l.e. phonetic elements as shown in FIG. 1 . These phones 
correspond generally to the sounds of vowels and consonants as are commonly used in phonetic alphabets. In actual 
speech, a portion of a word may have different pronunciations as is indicated by the parallel branches in FIG. 1 . The 
parallel branches which extend between nodes through which all such branches pass may altematively be considered 
together as a "clinic'' or as separate conventional phones. The clinic, as the principles of this Invention apply, may be 

15 viewed as a substitute phonetic element for the phones discussed hereinbelow. The phones in tum, are represented 
by Markov models. Referring now to FIG. 2 sample Markov model for a phone is illustrated. For each phone there is 
a conresponding Markov model characterized by (a) a plurality of states (SO . . . S4), (b) transitions (T1 . . . T10 ) be- 
tween the states, and (c) label probabilities, each representing the likelihood that the phone will produce a particular 
label at a given transition. In one embodiment each transition in the Markov model has two hundred stored label prob- 

20 abilities associated therewith, each probability representing the likelihood that each respective label (of a set of 200 
labels) Is produced by the phone at a given transition. Different phones are distinguished in their respective Markov 
models by differences in the label probabilities associated with the various transitions. The number of states and tran- 
sitions therebetween may differ but, preferably, these factors remain the same and the stored label probabilities vary. 
[0015] In the Markov model of FIG. 2, a string of labels SX1 -SX3-SX5-SH2 (taken from Table 2) has entered the 

25 phone model in the order shown. The probability of each label occurring at the transition at which it is shown (e.g. SX1 
at transition T1) is determined based on the corresponding stored label probability. Phone models having the highest 
label probabilities for the labels in the string are the most likely phones to have produced the string. 
[0016] While the labels In FIG. 2 suggest continuity from label to label along transition to transition^hich enables a 
simple one-to-one alignment between string label and transltion-the Markov model of FIG. 2 also permits other align- 

30 ment as well. That is, the Markov rpodel of FIG. 2 can determine that a phone Is likely even where more labels, less 
labels, or even different labels are applied to the phone model. In this regard, besides transitions from one state to 
another, there are also transitions (T5. T6, T7) that go back to the same state that was just left. Furthermore, there are 
transitions {T8. T9. T10) that skip a neighbor state. The Markov model thereby provides that different pronunciations 
of a phone can be accommodated in the same basic Markov model. If, for example, a sound is stretched (slow speaker) 

35 so that the same acoustic element appears several times instead of only once as usual, the Markov model represen- 
tation allows several transitions back to the same state thus accommodating the several appearances of the acoustic 
element, if, however, an acoustk: element that usually belongs to a phone does not appear at ail in a particular pro- 
nunciation, the respective transition of the nrK>dercan be skipped. 

[0017] Any possible path (Markov chain) from the initial state to the final state of the Markov model (including multiple 
^ occunrences of the tumback transitions, T5, T6 or 17) represents one utterance of the word (or phone), one acoustic 
element or label being associated with each transition. 

[0018] In the present invention, label strings are "aligned" to Markov models by associating labels in the string with 
transitions in a path through the model; detennining probabilities of each label being at the associated transition, on 
the basis of stored label probabilities set by previous experiences or training (as explained bek)w). A chain of Markov 
45 models having the highest probability identifies the word that is to be selected as output. 

[0019] The baseforms of the words and the basic Marlcov models of phones can be derived and defined in different 
ways, as described in the cited literature. Model generation may be done by a linguist, or the models can be derived 
automatically using statistical methods. As the preparation of the models Is not part of the invention, it will not be 
described in more detail. 

so [0020] It should be mentioned that instead of representing words first by a sequence of Markov phone models, they 
could also be directly represented by Markov word models-as by a sequence of states and translttons that represent 
the basic stringi of acoustic elements for the whole word. 

[0021] After structuring of the basic models that represent the words in a vocabulary, the models must be trained in 
order to f umish them with the statistics (e.g. label probabilities) for actual pronunciatk)ns or utterances of all the words 
55 in the vocabulary. For this purpose, each word Is spoken several times, and the label string that Is obtained for each 
utterance is "aligned" to the respective word model, i.e. it Is detemnined how the respective label string can be obtained 
by stepping through the model, and count values are accumulated for the respective transitions. A statistical Markov 
model Is fomnutated for each phone and thus for each word as a combination of phones. From the Markov model it 
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can be determined with what probability each of vartous different label strings were caused by utterance of a given 
word of the vocabulary. A storage table representing such a statistical Marl(ov model is shown in FIG. 3 and will be 
explained in more detail in a later section. 

[0022] For actual speech recognition, the speech signal is converted by the acoustic processor to a label string which 
5 is then "matched" against the existing word models. A specific procedure, the Viterbl Algorithm (described briefly in 
the above mentioned Jelineic paper and in more detail in a paper by G. D. Forney. The Viterbl Algorithm". Proceedings, 
IEEE, Vol. 61 , 1 973, pp. 268-278) is used for this, and the result is a probability vector for each of a number of "close" 
words which may have caused the given label sequence. Then the actual output, i.e. the identification of a word that 
is selected as the recognition output is detemiined by selecting the word whose probability is found to have the highest 
10 generated probability vectors. 

[0023] The estimation of phone probabilities is an essential part of "the match". Typically, the recognitbn is canled 
out in a maximum likelihood framework, where all words in the vocabulary are represented as a sequence of phones, 
and the probability of a given acoustic feature vector, conditioned on the phone is computed (i.e. P (acoustic/phone)). 
The recognition process hypothesizes that a given word in the vocabulary is the correct word and computes a proba- 
15 biilstic score for this word as described above; this is done for all words in the vocabulary, subsequently, the acoustic 
score is combined with a score provided by a language model, and the word with the highest combined score is chosen 
to be the correct one. 

[0024] The probability P (acoustic/phone) is equal to the probability that the cunrent state of the Markov rriodel for 
the phone produces the observed acoustic vector at the current time, and this probability is accumulated over several 

20 time frames till the cumulative product falls below a defined threshold, at which point it is hypothesized that the phone 
has ended and the next phone has started. In this technique, it is possible that in computing this score, frames that do 
not actually belong to the current phone are also taken into account while computing the score for the phone. This 
problem can be avoided if the beginning and end times of a phone are known with a greater level of certainty A 
technique to estimate the boundary points is given in ["Transfonn Representation of the Spectra of Acoustic Speech 

^ Segments with Applications - hGeneral Approach and Speech Recognition", IEEE Transactions on Speech and Audio 
Processing, PP. 180-195, vol. 1 , no. 2, April 1993], which is based on using the relative variation between successive 
frames, however it is quite expensive computationally, and is also constrained in terms of the extent of the acoustic 
context that it considers. 

[0025] In some speech recognition systems, "the match" is carried out in two stages. The first stage of the decoder 

30 provides a short list of candidate words, out of the 20K vocabulary Subsequently, detailed models of the words In this 
short list are used to match the word to the acoustic signal, and the word with the highest score Is chosen. The process 
for detemnining the short list, called the fast match (See U.S. Patent 52631 1 7 titled "Method and Apparatus for Finding 
the Best Splits in a Decision Tree for a l.^nguage Model"), organizes the phonetic basefomns of the words in the 
vocabulary in the form of a tree, and traverses down this tree, computing a score for each node, and discarding paths 

35 that have scores below a certain threshold. A path comprises of a sequence of phones, and often, the score for several 
phones has to be computed before a decision can be made to discard the path. In an eariier invention ("Channel-Bank- 
Based Thresholding to Improve Search Time In the Fast Match", IBM TDB pp. 113-114. vol. 37, No. 02A. Feb. 1994), 
a method was described, whereby, by observing the output of a channel-bank, a poor path could be discarded at a 
very eariy stage, thus saving the cost of computing the scores for the remaining phones on the path. In "Channel-Bank- 

40 Based Thresholding to Improve Search Time in the Fast Match", IBM TDB pp. 113-114, vol. 37. No. 02A, Feb. 1994 
the channel-bank outputs were computed in a "blind" fashion, as no information was available about the start and end 
times of a phone in the acoustic label sequence. In this invention, we describe a method of computing the channel- 
bank outputs in a more Intelligent fashion, that results in a reduction in the overall error rate and also reduces the 
computation time of the fast match. 

^ [0026] Conrespondingiy. there is proposed a method as set out In claims 1 and 4, and an apparatus as set out in 
claims 6 and 9. 

[0027] This invention proposes an alternative technique to predict phone boundaries that enabies the use of an 
extended acoustic context to predict whether the current time is a phone boundary. The Invention uses a non-linear 
decision-tree-based approach to solve the problem. The quantized feature vectors at, and in the vicinity of, the current 

50 time are used to predict the probability of the cun-ent time being a phone boundary, with the mechanism of prediction 
being a decision tree. The decision tree is constructed from training data by designing binary questions about the 
predictors such that the uncertainty In the predicted class is minimized by asking the question. The size of the class 
alphabet here is 2. and the technique of [L. Breiman, J.H. Friedman, R.A.OIshen, C.J. Stone, "Classification and Re- 
gression Trees". Wadsworth, inc.. 1984] is used to design the questions for each predictor. 

55 [0028] The invention also describes a technkiue that can be used to further cut down the search space of the speech 
recognition system. Assuming that the phone boundaries are known, it Is possible to compute the score for all phones 
in the segment between two phone boundaries, and compute the rank of the correct phone in this segment, ideally, of 
course, the correct phone should be ranked first, and it should be possible to eliminate all phones other than the topmost 
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phone from the search space. However, in reality, due to ambiguities in the acoustic modelling, the vector-quantized 
acoustic feature vectors in the segment may not be representative of the sound or phone which was actually uttered 
in the segment. 

[0029] Consequently, the rank of the correct phone could be quite poor in certain segments. 

5 [0030] The invention also describes a decision-tree-based technique to predict the worst case ranic of the correct 
phone between two hypothesized phone boundaries. Once this worst case ranIc is loiown, ali the phones that are 
tanked below the worst case ranIc are eliminated from the search space of the recognizer, resulting In a large saving 
in computation. Note that the technique Is independent of the method used to compute the score for a phone; typical 
schemes are (a) the usual Markov-model based computation (b) a channel-bank-based computation as described in 

10 ["Channel-Bank-Based Thresholding to Improve Search Tirne in the Fast Match", IBM TDB pp. 113-114, vol. 37, No, 
02A. Feb. 1 994] and (c) a decision-tree-based scoring mechanism, as described in [co-pending US Patent Application, 
D. Nahamoo, M. Padmanabhan, MA PIcheny, P.S. Gopalknshnan, "A Dedslon Tree Based pruning strategy for the 
Acoustic Fast Match, IBM Attomey Docket YO 996-059], or any alternative scoring mechanism. 
[0031] The predictors used in the decision tree are, as before, the quantized acoustic feature vectors at, and in the 

15 vicinity of, the current time, and the predicted quantity Is the worst case rank of the correct phone at the current time. 
The decision tree is constmcted from training data by designing binary questions about the predictors, which are asked 
while traversing down the nodes of the decision tree. The questions are designed to minimize the uncertainty in the 
predated class. Unlike the previous case of boundary estimation, however, the size of the class alphabet is equal to 
the number of phones, which is typically much larger than 2, and the technique outlined in ["Method and Apparatus for 

20 Ginding the Best Splits In a Decision Tree for a Language Model for a Speech Recognizer, U.S. Patent 5263117] is 
used to design the questions for each node. 

[0032] The objective of the invention is to take the given vector-quantized feature vectors at the cunrent time t, and 
the adjacent N time frames on either side, and devise two decision-trees. The first decision-tree should give the prob- 
ability of the current frame being a phone boundary, and the second decision tree should give a distribution over all 
25 possible ranks that the correct phone can take at that time, from which the worst case rank of the current phone can 
be obtained. 

[0033] A decision tree having true or false (i.e., binary) questions at each node and a probability distribution at each 
leaf is coristructed. Commencing at the root of the tree, by answering a question at each node encountered and then 
following a first or second branch from the node depending upon whether the answer is "true" or "false", progress Is 
30 made toward a leaf. The question at each node Is phrased In temns of the available data {e.g., the words already 
spoken) and is designed to ensure that the probability distribution at the leaves provide as much infomnation as possible 
about the quantity being predicted. 

[0034] A principal object of the inventton is, therefore, the provlston of a method of designing arid constructing a 
binary decision tree having true or false questions at each node starting from the root of the tree towards a leaf. 
35 [0035] Anotherobjectofthelnventionis theprovisionof a method of constructing a binary-decision tree using ques- 
tions phrased in terms of the available known data and designed to ensure that the probability distribution at the leaves 
maximize the information about the quantity being predicted. 

[0036] A further object of the invention Is the proviston of a method of constnjcting a binary decision tree primarily 
for use in speech pattern recognition. 
^ [0037] Further and still other objects of the Invention will become more cleariy apparent when the following description 
is read In conjunction with the accompanying drawings. 
[0038] The invention Incorporates the following features: 

a) The boundary points of phones in the acoustic label sequence are estimated by using a deciston tree and the 
45 adjoining labels, i.e., in the context of the labels adjacent on both sides of the current label, a deciston is made as 

to whether the current label represents the boundary point between two phones. In the remainder of this disclosure, 
the term "segment" will be used to denote the time interval between two boundary points. 

b) Based only on the labels in a segment, a score Is computed for all possible phones, based on the probabilities 
obtained from the decision tree described in ("Channel-Bank-Based Thresholding to Improve Search Time in the 

50 Fast Match", IBM TDB pp. 113-114. vol. 37, No. 02A, Feb. 1994). As mentioned eariier, alternative scoring mech- 
anisms could be used to compute the score for a phone. TTie phones are next ranked in accordance to their scores. 

c) A decision Is made that all phones above a certain rank are "good" phones that are possible in the time segment 
of interest, and that the phones below this threshold rank are "bad" phones that are not possible in the segment 
of interest The threshold rank is not fixed but is a function of the label sequence in the cunent segment and the 

55 adjacent segment, and Is obtained by using a decision tree. The decision is made on the basis of the label at the 
start of the segment and the adjacent labels on either side of this label. 

d) To avoid errors due to the pruning, the number of candidate phones is now increased by using phone classes, 
i.e., from training data a list is made for each phone, of the phones that are confusable with it. When decoding, for 
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every "good" phone obtained in step (c), all phones in the confusion class of the 'good' phone are also designated 
as 'good" phones. 

e) An alternative to eliminating all "bad" phones from the search Is to penalize the score for these bad phones in 
all subsequent computations in the fast match. All this is precomputed tsefore the actual fast match. 

5 

[0039] The implementation of the algorithm in the decoder lakes the following steps: 

Given a sequence of labels, the following precomputation is done before the fast match: the phone probabilities are 
first computed from a decision tree as described In ("Channel-Bank-Based Thresholding to Improve Search Time in 
the Fast Match", IBI^ TDB pp. 113-114. vol. 37, No. 02A, Feb. 1994). Subsequently, the boundary points of phones in 

10 the acoustic label sequence are determined by using the first decision tree described above, and the ranks of different 
phones are computed within all segments, based on the probabilities obtained from the decision tree of ("Channel- 
Bank-Based Thresholding to improve Search Time in the Fast Match", IBM TDB pp. 113-114, vol. 37, No. 02A, Feb. 
1994). Then the threshold rank that should be applied In every segment Is obtained by traversing down the second 
decision tree described above. The phones ranked above the threshold, and the phones in union of their confusion 

15 classes, are then designated as "good" phones, and the remainder as "bad" phones. The probabilities for the "bad" 
phones in the given segment are then penalized. This penaiization is done both on the phone probabilities obtained 
from the decision tree of ("Channel-Bank-Based Thresholding to Improve Search Time In the Fast Match", IBM TDB 
pp. 113-114, vol. 37, No. 02A, Feb. 1994), and on the acoustk) fast match probabilities. 

[0040] Subsequently, the fast match tree is pruned using the modified probabilities above, using the techniques 
20 described in ("Channel-Bank-Based Thresholding to Improve Search Time in the Fast Match", IBM TDB pp. 113-114, 
vol. 37, No. 02A, Feb. 1 994, "Transform Representation of the Spectra of Acoustic Speech Segments with Applications 
-l:General Approach and Speech Recognition", IEEE Transactions on Speech and Audio Processing, PP. 180-195, 
vol.1,no.2, Aprll1993). 

[0041] Hence, the training data used for the construction of the decision tree consists of sets of records of 2N+1 
25 predictors (denoted by the Indices -N,...0,...N) and the class associated with index 0, (which is assumed to be known). 
The associated class, in the case of the first decision tree is a binary record that specifies whether or not the frame at 
index 0 is a phone boundary. The associated class, in the case of the second decision tree is the rank of the correct 
phone at index 0. The alphabet size of each predictor Is In the hundred's, and the class alphabet size is either 2 in the 
case of the first decision tree, or typically 50 or so In the case of the second decision tree. The invention uses the 
30 technique described betow to construct the two decision trees (note that the two trees are constructed independently 
of one another). 

[0042] The invention uses a successive data partitioning and search strategy to determine the questions of the de- 
cision tree. Starting with all the training data at the root of the tree, the Invention chooses one of the 2N-I-1 predictors 
and partitions the alphat)et of the predictor into two non-overlapping sets. Subsequently, for all the training records at 

35 the cun-ent node, if the value of the chosen predictor lies in the first set, the record Is assigned to the first set, othenvlse 
it is assigned to the second set. Hence, the training data at the current node is distributed between two child nodes on 
the basis of the set membership of the selected predictor. The predictor and the partitioning of the alphabet are chosen 
in such a way that after the training data Is partitioned as described above, the uncertainty in the predicted class is 
minimized. The procedure is repeated for each child of the current node, till the class uncertainty at a node (quantified 

40 by the entropy of the class distribution at the node) falls betow a certain level, or till the amount of training data at a 
node fails below a certain level. After the tree is constructed, the class distribution at the terminal nodes of the tree Is 
available, and is stored along with the questions of the tree. 

[0043] For the case of the first decision tree, the stored quantity is simply the probability that the node is a phone 
boundary. For the case of the second decision tree, the quantity available at the nodes of the tree Is a distribution over 
^ all possible ranks that the correct phone can take. This distrlbutton Is converted to a single number, a worst case rank, 
such that the probability that the rank of the correct phone Is better than the worst case rank Is stored at the node of 
the decision tree. 

[0044] For the case of a single predictor and a class, Nadas and Nahamoo [U.S. Patent 5236117] describe a tech- 
nique to find the best binary question that minimizes the uncertainty in the predicted class. At the current node, this 

50 technique Is applied independently to each of the 2N+1 predbtors, and the best questton for this predictor Is determined. 
Subsequently, the best one among the 2N+1 predictors Is detenmlned as the one that provides the maximum reduction 
in class uncertainty and the questton at the current mode Is formulated as the best questton for this prediction. Alter- 
natively, the question at a node could also be made more complex, such that it depends on more than one predictor, 
or an inventory of fixed complex questions could be used, and the best question chosen as the one in this inventory 

55 that provides the maximum reduction in class uncertainty. 

[0045] It is another object of the invention to describe means whereby the above described deciston tree can be 
used In a speech recognizer. During recognition, the first decision tree is traversed till it reaches one of the temrilnal 
nodes, and the probability of the current time being a phone boundary is obtained from the temnlnal node of the decision 
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tree. This is compared to a predetermined threshold, and if it Is larger than the threshold, the current time Is hypothesized 
to be a boundary point. Subsequently, the second decision tree Is traversed for all time frames between two hypothe- 
sized phone boundaries, and the worst case ranic of the correct phone is obtained from the terminal node of the decisbn 
tree, for all these time frames. The worst of these worst case ranks is taken to be the worst case rank of the correct 

5 phone in that segment Subsequently, the score for all phones is computed on the basis of that segment, and the 
phones are ranked according to their scores. Then the phones that are ranked below the worst case rank are discarded 
from the search, thus making up a shortlist of allowed phones for every segment between two hypothesized phone 
boundaries. This list may also be augmented further by considering phones that are confusable with each other, and 
by including every element of a "confusable" list in the short list whenever any one element in the cpnfusable list is 

10 ranked atjove the worst case rank. 

[0046] This infomiation is used in the maximum likelihood framework to detemriine whether to carry out a match for 
a given word, by constraining the search space of the recognizer to the shortlist, rather than the space of the entire 
alphabet. Before canying out the match for a given phone in a word, the above defined shortlist is checked to see if 
the phone can possibly occur at the given time, and if the phone does not occur in the shortlist, then the match for tiie 

15 cun^ent word Is discarded. 

[0047] The method and apparatus according to the invention are advantageous because (a) they provide a fast and 
accurate way of estimating phone boundaries, by enabling the match for a phone to be done within well defined bound- 
aries thus leading to better accuracy (b) they provide a fast and accurate means of estimating the rank boundaries of 
the conrect phone without requiring any knowledge about the Identity of the correct phone, and thus enable the creation 

20 of a shortlist of allowed phones, which helps in greatly cutting down the search space of the speech recognizer. Further, 
the overhead associated with traversing the two decision tree's is negligible, as the questions asked in the decision 
tree simply involve the set membership of the selected predictor. 

Fig. 1 is an illustration of phonetic baseforms for two words; 
^ Fig. 2 is a schematic representation of a Markov model for a phone; 

Fig. 3 shovys a partial sample of a table representing a statistical Markov model trained by numerous utterances. 

Fig. 4 is a fk>w chart describing a procedure for constructing a decision tree to predict the probability distribution 

of a class at a given time, in accordance with the invention. 

Fig. 5 is a schematic for constructing a decision tree. 
30 Fig. 6. Is a flow chart of an automatic speech recognition system using two decision ti^ees. 

Fig. 7 Is a flow chart of an automatic speech recognition system using two decision trees. 

[0048] Figure 4 is a flow chart depicting the procedure to construct a decision tree to predict a probability distribution 
on tiie class values at time t, given the quantized feature vectors at times t-N,t-N+1 t t+N. For the purpose of 

35 explaining the woridng of the invention, the quantized feature vectors will henceforth be refen^ed to as labels. The 
predictors used in the decision tree are the labels at times t-N....,t..., t+N, represented as h'^,...|0 ...,1+^, and the predicted 
quantity is either a distribution over two classes as in the case of the boundary-detection decision tree, i.e., the prob- 
ability that tiie time t is a phone boundary, or a distribution over all possible ranks of the correct phone at time t, as in 
the case of tiie rank-determining decision tree. The size of the class alphabet in tiie second case Is equal to the size 

^ of the phone alphabet denoted as P. The size of ttie label alphabet Is denoted as L. Typically, P ranges from 50r100, 
and L is in the 100's; however, for the purpose of explaining the invention, we will assume that L=4, P=3, and N=1 . We 
will represent these 4 predictor values as 1^ ,12,13, and I4, and the 3 class values as p^ ,P2, and P3. The technique described 
below uses the procedure of [1] to determine the binary partitioning of the predictor alphabet at a node of the decision 
tree, which Is appropriate for the case of the rank-detemnining decision tree, where the number of classes is larger 

^ ttian 2. However, for tiie boundary-detection decision tree, where ttie number of classes is equal to 2, [U.S. Patent 
5263117 titled "Metiiod and Apparatus for Finding the Best Splits in a Decision Tree for a Language Model'] reduces 
to ttie simpler optimal strategy of [L. Breiman, J.H. Friedman, R.A.OIshen, C.J. Stone. 'Classification and Regression 
Trees'. Wadsworth, Inc., 1984]. 

[0049] The training data consists of a number of ti-anscribed sentences, with tiie acoustic corresponding to each 
50 sentence being quantized into a sequence of labels. Further as tiie data Is transcribed, it is also possible to assign a 
class value to every time frame. 

[0050] if tiie event Ip*^ is defined as one where the value of tiie predictor is equal to Ij, and the class value Is equal 
to p, then a confusion matrix is next created (Block 2), which enumerates the counts of all possible events (/J,p). The 
matrix has L rows, and P columns, and tiie entry corresponding to the row and tiie p column represents the number 
55 of times tiie value of ttie predictor ^ equalled \, when the class value equalled pj, in ttie ti'alning data at tiie current 
node of tiie decision tree (at ttie root node, all tfie training data Is used). These counts are ttien converted into joint 
probabilities by computing ttie sum of all entiles In the mati-ix, and tiien dlvkllng each entry of ttie matrix by tills sum. 
As tiiere are 2Nt1 predrctors, 2N-I-1 joint distiibution matrixes can be created, one for each predtetor. An example of 
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these Joint distribution matrices is sliown in Table 2 below, for the case of 3 predictors P, and l+i. 



TABLE 2 



5 



IS 



20 



1-1 


Pi 


P 




'1 


U. 1 


n f\c:~7 
0.0d7 


0.033 


l2 


0.067 


0.167 


0.033 


L 


0.133 


0.033 


0 1 


U 


0.033 


0.067 


0.167 




Pi 




K3 


1 


0.133 


0.05 


0.033 


•2 


0.067 


0.2 


0.034 




0.1 


0.034 


0.067 




0.033 


0.05 


0.2 


1*1 


Pi 


p2 


Pa 


"l 


0.117 


0.05 


0.033 


l2 


0.067 


0.167 


0.033 


I3 


0.116 


0.05 


0.1 


"4 


0.033 


0.067 


0.167 



25 

[0051] The class distribution at the current node and its entropy is computed and stored at this point The class 
distribution Is obtained by summing up the rows of any one of the 2N+1 joint distribution matrices, i.e. 



30 ^ 

35 and the entropy of the class distribution Is obtained as 



3 

H(p) -Prip^p^) log [Pr(p^Ps) 1 . 



[0052] For the considered example, the class distribution and its entropy is given In Table 3.The iog in H(p) is base 2. 

45 

TABLE 3 



50 





Pi 


P2 


P3 


Pr 


0.333 


0.334 


0.333 


H 


(p) = 1.58 



[0053] In B\ock 3. we start with the Joint distribution of the 1^ predictor, f^, and the class p. and design a binary 
partitioning SLj^, SLj^ of the values of the predictor i^^ using the method of [U.S. Patent 523611 7 referenced above]. 
In ot her wo rds, for each predictor, the pre dictor a lphabet [11,12.13.14] is partitioned into two complementary sets, SLj^^ 
55 and SL (for example, SL^^ = [/,,y , and SL^^ = [l^,!^), with the criterion for the selection of the partition being the 
minimization of the class uncertainty. The entropy of the class distribution Is used as a measure of the uncertainty. The 
details of this method are given In [U.S. Patent 523611.7]. This process is canied out of each predictor independently. 
For the considered example, one iteration of the procedure In [U.S. Patent 5236117, col. 4. Iine30-col. 9, line 25] leads 
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to a nearly optimal partitioning of the different predictors as follows: 



5 

[V/2,andiLj=I/3.g. 

10 

[0054] Now. for each one of the predictors i^, the training data at the cunrent node may be split into two parts based 
on the partitioning SLJ^,SLJ^ and ttie probability of these two child nodes is given as: 

15 

and 



[0055] Further, the class distribution conditioned on the partitioning, at the two child nodes may be calculated as 
30 follows: 



Pr(p^pJSL^p^)^ Prd^^lj^p^pJ/PriSL^pc) 



and 

40 

Pr{p^p„/SL^t)^ J2 Pril'^^Ij^P^pJ/PriSL^t) 

45 [0056] The entropy for each of these child nodes can be calculated just as for tiie parent node and ttie average 
entropy of the two child nodes co mputed as 

^avg = P^^^^^ ^^^^%t^ + H(pfSL^). For the considered example, these quantities are tabulated in 

Table 4 below. 



so 




Pi 


P2 


P3 






0.358 
0.312 


0.5 
0.188 


0.142 
0.5 


55 


P'^^%) 


0.418 
0.117 


0.396 
0.177 


0.187 
0.707 
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(continued) 





Pi 


P2 


P3 




0.394 
0.28 


0.465 
0.22 


0.141 
0.5 
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55 



Pri SL^^ ) =0.467 Pr{ SC^ ) =0.533) 
H(p/ SL^) =1.43 H(p/ SlJ) =1.477 H],'^ =1.455 

Pr( SL^,)=1.717 Pr( SL^) =0.283 
H(p^ SlJ^ ) =1 .508 H{p/ SL^ ) ^1 .1 58 Wji^ =1 .409 
Pit SL;^, ) -0.467 PR( SlJ -0.533 

H(p/ SL;^).1.442 H(p/ SlJ). 1.495 H^J^^ -1.470 

[0057] In Block 4, the reduction in class uncertainty associated with the best question for each predictor is tabulated, 
and the predictor which provides the largest reduction in uncertainty Is selected. The reduction in uncertainty due to a 
partitioning based on SU^ is computed as H(p-^,^. For the considered example, we have H(p) = 1 .58, hrj^ 1 .455, 
= 1 .409 and H*' = 1 .470. Hence, the selected predictor is 1°, as this gives the maximum reduction in me uncer- 
tain^ of the predicted class. 

[0058] In Bldcl< 5, the training data at the current node is partitioned into two parts on the basis of the optimal parti- 
tioning of the selected predictor at the cun-ent node. 

[0059] Subsequently, depending on the class uncertainty and the amount of training data at a child node, the process 
goes bacic to Blocic 2, and starts again by recomputing the joint distribution on the basis of only the training data at the 
child node. The processing at a child node temiinates when the class uncertainty at the child node falls below a specified 
threshold, or if the amount of training data at a child node fails below a specified threshold. 
[0060] Fig. 5 schematically shows an apparatus for constnicting the decision tree. The apparatus may comprise of, 
for example, an appropriately programmed computer system. In this example, the apparatus comprises of a general 
purpose digital processor 8 having a data entry !<eyboard 9, a display 10, a random access memory 11 , and a storage 
device 12. From the training data, processor 8 computes the joint distribution of the predictor and the class value p, 
for the first decision tree, for all 2N-I-1 predictors, using ail of the training data, and stores the estimated joint distribution, 
along with the class distribution, in storage device 12. 

[0061] Next processor 8 computes the best partitioning of each of the predictor values such that the maximum re-, 
ductipn in class uncertainty is obtained due to the partitioning, according to the algorithm of [U.S. Patent 5236117]. 
Then processor 8 chooses the best predictor, r, and partitions the training data into two child nodes based on the best 
partitioning for the predictor I*. 

[0062] Still under the control of the program, the processor 1 0 repeats the above procedure for the data at each of 
the two child nodes, till the class entropy at the node falls below a specified threshold, or till the amount of training data 
at a node falls below a specified threshold. 

[0063] After the decision tree is grown, still under control of the program, the processor computes a distribution on 
class values for every node of the decision tree, and stores It in storage device 1 2. The above process is then repeated 
to construct the second decision tree. For the case of the second decision tree, the probability distribution over all 
possible ranks, which is stored at every node of the tree is converted into a single number, the worst case ranic of the 
correct phone, by choosing the worst case rank as the class value at whteh the cumulative probability distribution of 
the classes exceeds a specified threshold. 

[0064] Fig. 6 is a block diagram of an automatic speech recognition system which utilizes the decision tree according 
to the present invention. The system in Fig. 6 includes a microphone 13 for converting an utterance into an electrical 
signal. The signal from the microphone is processed by an acoustic processor and label match 1 4 which finds the best- 
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matched acoustic label prototype from the acoustic label prototype store 15. A probability distribution on phone bound- 
aries 16a Is then produced for every time frame using the first decision tree 17a described in the Invention. These 
probabilities are compared to a threshold and some time frames are identified as boundaries between phones. Sub- 
sequently, an acoustic score is computed 16b, for all phones between every given pair of hypothesized boundaries, 

5 and the phones are ranked on the basis of this score. Note that this score may be computed in any fashion, with the 
only constraint being that the score is computed using the same technique as was used when constructing the second 
decision tree. Subsequently, the second decision tree 17b is traversed for every time frame to obtain the worst case 
rank of the conect phone at ttiat time, and using the phone score and phone rank computed In 1 6b, a shortlist of allowed 
phones 16c Is made up for every time frame. This information is used to select a subset of acoustic word models in 

10 store 1 9, and a fast acoustic word match processor 1 8 matches the label string from the acoustic processor 1 4 against 
this subset of abridged acoustic word models to produce an output signal. 

[0085] The output of the fast acoustic word match processor comprises of at least one word. In general, however, 
the fast acoustic word match processor will output a number of candidate words. 

[0066] Each word produced by the fast acoustic word match processor 18 is Input into a word context match 20 
15 which compares the word context to language models In store 21 and outputs at least one candidate word. From the 
recognition candidates produced by the fast acoustic match and the language model, the detailed acoustic match 22 
. matches the label string from the acoustic processor 1 4 against detailed acoustic word models in store 23 and outputs 
a word string corresponding to an utterance. 

[0067] Fig. 7 describes Blocks 16a-c and 17a-b in further detail. Given the acoustic label string from the acoustic 
20 processor 14, the context-dependent boundary estimation process 16 traverses the first decision tree 17a for every 
time frame using the label at the current time and the labels at the adjacent times as the predictors, until it reaches a 
terminal node of the tree. Then the probability that the current time is a phone boundary is picked up from the stored 
class distribution at the leaf, and compared to a threshokJ. If the probability Is larger than the threshoki. It Is hypothesized 
that the current time is a phone boundary. 
25 [0068] Subsequently, an acoustic score is computed for every phone between every pair of boundary points and the 
phones are ranked on the basis of these scores. One of several techniques could be used to compute this score, of 
example, the usual markov based computation could be used, or a channel-bank-based computation as described in 
[■Channel-Bank-Based Thresholding to Improve Search Time in the Fast l\^atch", IBM TDB pp. 113-114, vol. 37, No. 
02A, Feb. 1994] coukJ be used^ or a decisk>n-tree-based scoring mechanism, as described in [D. Nahamoo. M. Pad- 
so manabhan, M. A. Picheny, P. S. Gopalkrishnan, 'A Decision Tree Based Pruning Strategy for the Acoustic Fast Match", 
IBM Attorney Docket No. YO 9i96-059]; the only constraint on the scoring mechanism is that the same mechanism 
should be used as was used when obtaining the training records for the second decision tree. 
[0069] Subsequently, the second decision tree 17b is traversed for every time frame, using the label at the cunent 
time and at the adjacent times as the predictors, till a terminal node of the tree is reached. The worst case rank of the 
35 cpnrect phone Is read from the data stored at this node and taken to be the worst case rank of the correct phone at 
this time. Subsequently, the worst of the worst-case ranks between any two adjacent hypothesized phone boundaries 
is taken to be the worst case rank of the con-ect phone in the segment between the phone boundaries. All the phones 
whose ranks are worse than this worst case rank are then discarded in the current segment, and a shortlist of allowed 
phones is made up for the segment. 
^ [0070] Now, it Is often the case that some phones are very similar and may be easily confused with each other. Lists 
of such confusable phones can be made from the training data, and the shortlist described above may be augmented 
by adding in these lists of confusable phones. For instance, if the rank of any orie element in a list of confusable phones 
is better than the worst case rank, the entire set of confusable phones are Included in the short list. 
[0071] It should be understood that the foregoing description is only lllustrath/e of the invention. VAarious alternatives 
45 and modifications can be devised by those skilled in the art without departing from the invention. Accordingly, the 
present Invention is intended to embrace ail such alternatives, modificatk)hs and variances which fall within the scope 
of the appended claims. 
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THE TWO LETTERS ROUGHLY REPRESENT THE SOUND OF THE 
ELEMENT. 

TWO DIGITS ARE ASSOCIATED WITH VOWELS: 

FIRST: STRESS OF SOUND 

SECOND: CURRENT IDENTIFICATION NUMBER 
ONE DIGIT ONLY IS ASSOCIATED 
WITH CONSONANTS 

SINGLE DIGIT: CURRENT IDENTIFICATION NUMBER 



001 


AA11 


029 


BX2- 


057 


EH02 


148 


TX5- 


176 


XXII 


002 


AA12 


030 


BX3- 


058 


EH11 


149 


TX6- 


177 


XXI 2 


003 


AA13 


031 


BX4- 


059 


EH12 


150 


UH01 


178 


XXI 3 


004 


AA14 


032 


BX5- 


060 


EH13 


151 


UH02 


179 


XXI 4 


005 


AA15 


033 


BX6- 


061 


EH14 


152 


UH11 


180 


XXI 5 


006 


AE11 


034 


BX7- 


062 


EH15 


153 


UH12 


181 


XX16 


007 


AE12 


035 


BX8- 


126 


RX1- 


154 


UH13 


182 


XXI 7 


008 


AE13 


036 


BX9- 


127 


SH1- 


155 


UH14 


183 


XXI 


009 


AE14 


037 


DH1- 


128 


SH2- 


156 


UU11 


184 


XX19 


010 


AE15 


038 


DH2- 


129 


SX1- 


157 


UU12 


185 


XX2- 


011 


AW11 


039 


DQ1- 


130 


SX2- 


158 


UXG1 


186 


XX20 


012 


AW12 


040 


DQ2- 


131 


SX3- 


159 


UXG2 


187 


XX21 


013 


AW13 


041 


DQ3- 


132 


SX4- 


160 


UX11 


186 


XX22 


014 


AX11 


042 


DQ4- 


133 


SX5- 


161 


UX12 


189 


XX23 


015 


AX12 


043 


DX1. 


134 


SX6- 


162 


UX13 


190 


XX24 


016 


AX13 


044 


DX2- 


135 


SX7- 


163 


VX1- 


191 


XX3- 


017 


AX14 


045 


EE01 


136 


TH1- 


164 


VX2- 


192 


XX4- 


018 


AX15 


046 


EE02 


137 


TH2- 


165 


VX3- 


193 


XX5- 


019 


AX16 


047 


EE11 


138 


TH3- 


166 


VX4- 


194 


XX6- 


020 


AX17 


048 


EE12 


139 


TH4- 


167 


WX1- 


195 


XX7- 


021 


BQ1- 


049 


EE13 


140 


TH5- 


168 


WX2- 


196 


XX8- 


022 


BQ2- 


050 


EE14 


141 


TQ1- 


169 


WX3- 


197 


XX9- 


023 


BQ3- 


051 


EE15 


142 


TQ2- 


170 


WX4- 


198 


ZX1- 


024 


BQ4- 


052 


EE16 


143 


TX3- 


171 


WX5- ' 


199 


ZX2- 


025 


BX1- 


053 


EE17 


144 


TX1- 


172 


WX6- 


200 


ZX3- 


026 


BX10 


054 


EE18 


145 


TX2- 


173 


WX7- 






027 


BX11 


055 


EE19 


146 


TX3- 


174 


XXI. 






028 


BX12 


056 


EH01 


147 


TX4- 


175 


XXI 0 


■ 





Claims 



50 



1. A method of recognizing speech, comprising the steps of: 
a) inputting a plurality of words of training data; 



55 



b) training one or more binary first decision trees to ask a maximally informative question at each node based 
upon contextual information in the training data, wherein each binary first decision tree may correspond to a 
different time In a sequence of the training data; 

c) traversing one of the decision trees for every time frame of an Input sequence of speech to determine a 
probability distribution for every time frame, the probability distribution being the probability that a node is a 
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phone boundary; 

d) comparing the probabilities associated with the time frames with a threshold for Identifying some time frames 
as boundaries between phones; 

e) providing an acoustic score for ail phones between every given pair of boundaries; 

f) rantcing the phones on the basis of this score; 

g) outputting a recognition result in response to the score. 
The method of claim 1 further including the steps of: 

h) traversing one or more of a second set of decision trees for every time frame on an input sequence of 
speech to detemnine a second probability distribution, the probability distribution being a distribution over ali 
possible ranlcs that the correct phone can take for obtaining a worst case rank of a correctly recognized phone 
by choosing the worst case rank as the class value at which the cumulative probability distribution of the 
classes exceeds a specified threshold; 

i) assigning as the absolute worst case rank of the worst case ranks between any two adjacent phone bound- 
aries the worst case rank of the con-ectiy recognized phone between the phone boundaries; 

. j) discarding all phones whose rank is worse than this absolute worst case rank in the current segment; 

k) making a short list of phones for the segment; 

I) outputting a recognition result in response to the short list of the recognition result being a short list of words. 
The method of claim 1 , further including the steps of: 

h) traversing one or more of a second set of decision trees for every time frame on an input sequence of 
speech to determine a second probability distribution, the probability distribution being a distribution over ail 
possible ranks that a phone can take for obtaining a worst case rank of a con^ectly recognized phone by 
choosing the worst case rank as the class value at whtoh the curnulative probability distribution of the classes 
exceeds a specified threshold; 

i) assigning as the absolute worst case rank of the worst case ranks between any two adjacent phone bound- 
aries the worst case rank of the correctly recognized phone between the phone boundaries; 

j) discarding all phone boundaries whose rank is worse than this absolute worst case rank in the cun'ent 
segment; 

k) making a short list of phones for the segment; 

I) comparing constituent phones of a word in a vocabulary to see if the word lies in the short list and making 
up a short list of words; 

t) outputting a recognition result by comparing the words of the short list with a language model to determine 
the most probable word match for the input sequence of speech. 

A method for recognizing speech, comprising the steps of: 

a) entering a string of utterances constituting training data; 

b) converting the utterances of the training data to electrical signals; 

c) representing the electrical signal of the training data as prototype quantized feature vectors, one feature 
vector representing a given time frame; 
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d) assigning to each prototype feature vector a class label associated with the prototype quantized feature 
vector; 

e) forming one or more binary decision trees for different times in the training data, each tree having a root 
node and a plurality of child nodes, comprising the steps of: 

I. creating a set of training records comprising 2K+1 predictors, K and one predicted class, p, where the 
2K+1 predictors are feature vector labels at 2K+1 consecutive times t-K, .... t, t, .... t+K. and the pre- 
dicted class is a binary record indicator whether time t Is associated with a phone boundary in the case 
of the first decision tree or is associated with the con-ect phone in the case of the second decision tree; 

il. computing the estimated joint distribution of predictors ^ and phone p for 2K+1 predictors using the 

training data, wherein the predictors are feature vector labels at times t-K t, .... t+K and p is the phone 

attlmet; 

Hi. storing the estimated joint distributton of and p and a con-esponding distribution for each predictor 
at the root node; 

iv. computing the best partitioning off the values that predictor can take for each ^ to minimize phone 
uncertainty at each node; 

V. choosing the predictor !•< whose partitioning results in the lowest uncertainty and partitioning the training 
data into two child nodes based on the computed-based partitioning t^, each child node being assigned 
a class distn'bution based on the training data at the child node; 

f) repeating for each child node if the amount of training data at the child node Is greater than a threshold; 

g) inputting an utterance to be recognized; 

h) converting the utterance Into an electrical signal; 

i) representing the electrical signal as a series of quantized feature vectors; 

j) matching the series of quantized feature vectors against the stored prototype feature vectors to determine 
a closest match and assigning an input label to each of the series of feature vectors corresponding to the label 
of the closest matching prototype feature vector; 

k) traversing one of the decision trees for every time frame off an input sequence off speech to determine a 
probability distributton for every time frame, the probability distribution being the probability that a node is a 
phone boundary; 

I) comparing the probabilities associated with the time frames with a threshold for identifying some time frames 
as boundaries between phones; 

m) providing an acoustic score for all phones between every given pair off boundaries; 

n) ranking the phones on the basis of this score; 

o) outputting a recognition result in response to the score. 

The method off claim 4 ffurther including the steps of: 

traversing one or more of a second set of decision trees for every time frame on an input sequence of speech 
to detemiine a second probability distribution, the probability distribution being a distribution over all possible 
ranks that a phone can take for obtaining a worst case rank of a correctly recognized phone by choosing the 
worst case rank as the class value at which the cumulative probability distribution off the classes exceeds a 
specified threshold; 
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assigning as tiie absolute worst case rank of the worst case ranlcs between any two adjacent phone boundaries 
the worst case rank of the conrectly recognized phone between the phone boundaries; 

discarding all phone boundaries whose rank is worse than this absolute worst case rank In the current segment- 
making a short list for the segment; 
outputting a recognition result in response to the short list. 
An apparatus for recognizing speech, comprising: 

a) means for inputting a plurality of words of training data; 

b) means for training one or more binary first decision trees to ask a maximally infonnative question at each 
node based upon contextual infomnation in the training data, wherein each binary first decision tree may cor- 
respond to a different time in a sequence of the training data; 

c) means for traversing one of the decision trees for every time frame of an Input sequence of speech to 
determine a probability distribution for every time frame, the probability distribution being the probability that 
a node Is a phone boundary; 

d) means for comparing the probabilities associated with the time frames with a threshold for identifying some 
time frames as boundaries between phones; 

e) means for providing an acoustic score for all phones between every given pair of boundaries; 

f) means for ranking the phones on the basis of this score; 

g) means for outputting a recognition result in response to the score. 
The apparatus of claim 6 further Including: 

h) means for traversing one or more of a second set of deciston trees for every time frame on an input sequence 

of speech to detemriine a second probability distribution, the probability distribution being a distribution over 
all possible ranks that the correct phone can take for obtaining a worst case rank of a correctly recognized 
phone by choosing the worst case rank as the class value at whteh the cumulative probability distribution of 
the classes exceeds a specified threshold; 

i) means for assigning as the absolute worst case rank of the worst case ranks between any two adjacent 
phone boundaries the worst case rank of the correctly recognized phone between the phone boundaries; 

j) means for discarding all phones whose rank is worse than this absolute worst case rank in the current 
segment; 

k) means for making a short list of phones for the segment; 

t) means for outputting a recognition result In response to the short list of the recognition result being a short 
list of words. 

The apparatus of claim 6, further including: 

h) means for traversing one or more of a second set of decision trees for every time frame on an input sequence 
of speech to detemnlne a second probability distribution, the probability distribution being a distribution over 
all possible ranks that a phone can take for obtaining a worst case rank of a correctly recognized phone by 
choosing the worst case rank as the class value at whtoh the cumulative probability distribution of the classes 
exceeds a specified threshokl; 

1) means for assigning as the absolute worst case rank of the worst case ranks between any two adjacent 
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phone boundaries the worst case rank of the correctly recognized phone between the phone boundaries; 

J) means for discarding all phone boundaries whose ranlc is worse than this absolute worst case rank In the 
current segment; 

k) means for making a short list of phones for the segment; 

I) means for comparing constituent phones of a word in a vocabulary to see if the word lies in the short list 
and making up a short list of words; 

m) means for outputting a recognition result by comparing the words of the short list with a language model 
. to detemnine the most probable word match for the input sequence of speech. 

An apparatus for recognizing speech, comprising: 

a) means for entering a string of utterances constituting training data; 

b) means for cohverting the utterances of the training data to electrical signals; 

c) means for representing the electrical signal of the training data as prototype quantized feature vectors, one 
feature vector representing a given time frame;. 

d) means for assigning to. ieach prototype feature vector a class label associated with the prototype quantized 
feature vector; 

e) means for forming one or more binary decision trees for different times in the training data, each tree having 
a root node and a plurality of child nodes, comprising the steps of: 

I. means for creating a set of training records comprising 2K+1 predictors, f^, and one predicted class, p, 
where the 2K+1 predictors are feature vector labels at 2K+1 consecutive times t-K, .... t, t, .... t+K, and 
the predicted class is a binary record indicator whether time t is associated with a phone boundary in the 
case of the first decision tree or is associated with the connect phone in the case of the second decision tree; 

II. means for computing the estiimated joint distributton of predtotors and phone p for 2K+1 predictors 
using the training data, wherein the predictors are feature vector labels at times t-K, t, t+K and p is 
the phone at time t; 

lii. means for storing the estimated Joint distribution of l>< and p and a con^esponding distribution for each, 
predictor I** at the root node; 

iv. means for computing the best partittoning of the values that predictor f« can take for each to minimize 
phone uncertainty at each node; 

V. means for choosing the predictor 1*^ whose partitk>ning results in the lowest uncertainty and partitioning 
the training data into two child nodes based on the computed-based partittoning f^, each child node being 
assigned a class distribution based on the training data at the child node; 

f) means for repeating for each child node if the amount of training data at the child node Is greater than a. 
threshold; 

g) means for Inputting an utterance to be recognized; 

h) means for converting the utterance into an electrical signal; 

i) means for representing the electrical signal as a series of quantized feature vectors; 

j) means for matching the series of quantized feature vectors against the stored prototype feature vectors to 
determine a closest match and assigning an input label to each of the series of feature vectors con-esponding 



16 



EP 0 715 298 B1 



to the label of the closest matching prototype feature vector; 

k) means for traversing one of the decision trees for every time frame of an input sequence of speech to 
detennine a probability distribution for every time frame, the probability distribution being the probability that 
a node Is a phone boundary; 

I) means for comparing the probabilities associated with the time frames with a threshold for identifying some 
time frames as boundaries between phones; 

m) means for providing an acoustic score for all phones between every given pair of boundaries; 

n) means for ranking the phones on the basis of this score; 

o) means for outputting a recognitton result In response to the score. 

10. The apparatus of claim 9 further Including: 

means for traversing one or more of a second set of decision trees for every time frame on an Input sequence of 
speech to detemiine a second probability distribution, the probability distribution being a distributton over all pos- 
sible ranks that a phone can take for obtaining a worst case rank of a correctly recognized phone by choosing the 
worst case rank as the class value at which the cumulative probability distribution of the classes exceeds a specified 
threshold; 

means for assigning as the absolute worst case rank of the worst case ranks between any two adjacent phone 

boundaries the worst case rank of the correctly recognized phone between the phone boundaries; 

means for discarding all phone boundaries whose rank is worse than this absolute worst case rank in tiie current 

segment; 

means for making a short list for the segment; 

means for outputting a recognition result in response to the short list. 



Patentanspruche 

1. Ein Verfahren zur Spracheri^ennung, das folgende Schritle umfaBt 

a) Eingabe mehrerer Wfirter der Trainingsdaten; 

b) Training eines oder mehrerer binSrer erster EntscheidungsbSume, um an jedem Knoten auf der Grundlage 
von Kontextdaten innertialb der Trainingsdaten eine mOglichst infomnative Frage zu steilen, wobei jeder binSre 
erste Entscheidungsbaum einem anderen Zeitpunkt in einer Sequenz der Trainingsdaten entsprechen kann; 

c) Durchlaufen eiries Entscheidungsbaums fOr jeden Zeitrahmen einer Spracheingabesequenz, um fur jeden 
Zeitrahmen eine Wahrscheinlichkeitsverteilung zu bestimmen, wobel die Wahrschelniichkeitsverteilung die 
Wahrscheinlichkeit ist, daB ein Knoten eine Phoriemgrenze ist; 

d) Vergleteh der Wahrscheinlichkeiten der Zeitrahmen mit einem Schwellenwert zur Bestimmung einiger Zeit- 
rahmen als Grenzen zwischen Phonemen; 

e) Bereitsteltung einer akustlschen Trefferzahl fur alle Phoneme zwischen jedem gegebenen Grenzenpaar 

f) Klassifizierung der Phoneme auf der Grundlage dieser Trefferzahl; 

g) Ausgabe eines Ertcennungsergebnisses In Abhanglgkelt dieser Trefferzahl. 

2. Das Verfahren gemSB Anspruch 1 , das weitertiin folgende Schritte umfaBt 

h) Durchlaufen eines Entscheidungsbaums oder mehrerer EntscheMungsbdume aus einer zweiten Gruppe 
von Entscheidungsb&umen fur jeden Zeitrahmen in einer Spracheingabesequenz zur Bestimmung einer zwei- 
ten Wahrschelnllchkeftsverteilung, wobei die Wahrscheinlichkeitsverteilung eine Verteilung uber alle Klassen 
Ist, die fOr das korrekte Phonem mOglich sind, um eine Klasse des schlimmsten Falls eines richtig ericannten 
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Phonems einzuholen, Indem die Klasse des schllmmsten Falls als Klassenwert gewahlt wird. bei dem die 
kumulative Wahrscheiniichkeitsverteilung der Klassen einen bestimmten Schwelienwert uberschreitet; 

i) Unter den Klassen des schllmmsten Falls Bestimmung zur Klasse des absolut schllmmsten Falls zwischen 
zwel bellebigen nebenelnander liegenden Phonemgrenzen der Klasse des schllmmsten Falls des richtig er- 
kannten Phonems zwischen den Phonemgrenzen; 

J) Aussparung aller Phoneme, deren Klasse schllmmer 1st als diese Klasse des absolut schllmmsten Falls im 
aktuellen Segment; 

k) Erstellung einer Kurzllste von Phonemen f Or das Segment; 

I) Ausgabe eines Erkennungsergebnisses, wenn die Kurzllste des Erkennungsergebnisses eine Kurzllste aus 
Wdrternist 

Verfahren gemdB Anspruch 1 . das weiterhin die folgenden Schritte umfaBt: 

h) Durchlaufen eines Oder mehrerer EntscheldungsbSume aus einer zweiten Gruppe an Entscheidungsb&u- 
men fOr jeden Zeitrahmen einer Spracheingangssequenz zur Bestimmung einer zweiten Wahrscheiniichkeits- 
verteilung, wobei die Wahrscheiniichkeitsverteilung eine Verteilung uber alle mSgiichen Klassen ist, in die ein 
Phonem aufgenommen werden kann, um eine Klasse des schllmmsten Falls eines richtig erkannten Phonems 
zu erhalten. und zwar durch Bestimmung der Klasse des schllmmsten Falls zum Klassenwert, bei dem die 
kumulative Wahrscheiniichkeitsverteilung der Klassen einen bestimmten Schwelienwert Gberschreitet; 

i) Unter den Klassen des schllmmsten Falls Bestimmung zur Klasse des absolut schllmmsten Falls zwischen 
zwel bellebigen nebenelnander liegenden Phonemgrenzen der Klasse des schllmmsten Falls des richtig er- 
kannten Phonems zwischen den Phonemgrenzen; 

J) Aussparung aller Phoneme, deren Klasse schllmmer ist als diese Klasse des absolut schllmmsten Falls im 
aktuellen Segment; 

k) Erstellung einer Kurzllste von Phonemen fur das Segment; 

i) Vergleich bestandteilbildender Phoneme eines Wprtes in einem Vokabular, um festzustellen, ob das Wort 
In der Kurzllste ehthalten ist, und Erstellung einer Kurzllste von WOrtem; 

1) Ausgabe eines Erkennungsergebnisses durch Vergleich der Worter aus der Kurzllste mit einem Sprachmo- 
dell, um die am melsten wahrscheinlrche Wortubereinstimmung fur die Spracheingangssequenz zu bestim- 
men. 

Ein Verfahren zur Spracherkennung, das die folgenden Schrltte umfaBt: 

a) Eingabe eines Strings von Sprachelementen, die Trainingsdaten darstellen; 

b) Umwandlung der Elemente der Trainingsdaten In elektrlsche SIgnale; 

c) Darstellung des etektrischen Signals der Trainingsdaten als prototyp-quantisierte Eigenschaftsvektoren, 
wobei ein Eigenschaftsvektor einien gegebenen Zeitrahmen darsteitt; 

d) Zuordnung eines KlassenlabelsfOr den prototypquantlsierten Eigenschaftsvektor zujedem Prototyp-Elgen- 
schaftsvektor; 

e) Aufbau eines oder mehrerer EntscheidungsbSume fur unterschiediiche Zeiten in den Trainingsdaten, wobei 
jeder Baum einen Wurzelknoten und eine Mehrzahl an KIndknoten aufweist, bestehend aus den folgenden 
Schritten: 

i. Blldung einer Gruppe von Trainingsaufzelchnungen, die 2K+1 Pr^diktoren. 1^, und eine vorausgesagte 
Klasse, p, umfassen, wobei die 2K+1 Prddiktoren Eigenschaftsvektorlabels an 2K+1 aufeinanderfolgen- 
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den Zeiten t-K, .... t, .... t+K sind und die vorausgesagte Klasse eine bindre Aufzeichnungsanzeige darOber 
ist, Ob der Zeitpunkt t zu einer Phonemgrenze im Fall des ersten Entscheidungsbaums gehOrt Oder zum 
korrekten Phonem im Fall des zweiten Entschekiungsbaums gehdrt; 

ii. Berechnung der geschStzten verbundenen Vertellung der PrSdiktoren 1^ und des Phonems p fOr 2K+1 
PrSdiktoren unter Verwendung der Trainingsdaten, wobel die Prddlktoren Elgenschaftsvektorlabels zu 
den Zeitpunkten t-K. .... t t+K sind und p das Phonem zunrt Zeitpunkt t Ist; 

iil, Spelcherung der geschStzten verbundenen Vertellung von 1*« und p und einer entsprechenden Vertel- 
lung fur jeden Pradlktor 1*^ am Wurzelknoten; 

iv. Berechnung der besten Partltbnierung der Werte, die der Pr&diktor l^^ fOr jedes 1^ annehmen kann, 
urn die PhonerhungewiBheit an jedem Knoten auf ein MIndestmaB zu beschrdnken; 

V. Auswahldes PrSdiktors 1^ dessen Partitlonlerung zur niedrigsten UngewlSheltfuhrt, und Partition ierung 
der Trainingsdaten in zwel Kindknoten, und zwar auf der Grundlage der computergesteuerten Partitlonle- 
rung 1^ wobel jedem Kindknoten auf der Grundlage der Trainingsdaten am Kindknoten eine Klassenver- 
teilung zugeordnet wird; 

f) Wiederholung der Bestimmung fOr jeden Kindknoten. ob der Umfang an Trainingsdaten am Kindknoten 
grdBer ist als ein Schwellenwert; 

g) Eingabe eines Sprachelements. das erkannt werden soli; 

h) Umwandlung eines Sprachelements in ein elektrisches Signal; 

i) Darstellung des elektrischen Signals als Serie quantlsierter Elgenschaftsvektoren; 

J) Verglelch der Serie quantislerter Elgenschaftsvektoren mit den gespeicherten Prototyp-Eigenschaftsvekto- 
ren zur Bestimmung einer engsten Oberelnstlmmung und Zuordnung eines Eingangslabels zu jedem Vektor 
aus der Serie der Elgenschaftsvektoren entsprechend dem Label des am engsten Obereinstimmenden Elgen- 

schaftsvektors; 

k) Durchlaufen eines Entscheidungsbaums fQr jeden Zettrahmen einer Sprachelngabesequenz, um fOr jeden 
Zeltrahmen eine Wahrscheinlichkertsvertellung zu bestimmen, wobel die Wahrschelnlichkeltsverteilung die 
Wahrscheinllchkeit ist, daB ein Knoten eine Phonemgrenze ist; 

I) Verglelch der Wahrscheinlichkeiten der Zeitrahmen mit einem Schwellenwert zur Bestimmung einiger Zeit- 
rahmen als Grenzen zwischen Phonemen; 

m) Bereitstellung einer akustischen Trefferzahl fQr alle Phoneme zwischen jedem gegebenen Grenzenpaar; 

n) Klassiflzieifung der Phoneme auf der Grundlage dieser Trefferzahl; 

o) Ausgabe eines Erkennungsergebnlsses in AbhSngigkelt dieser Trefferzahl. 

Das Verfahren gemSB Anspruch 4, das weiterhin folgende Schritte umfaBt: 

Durchlaufen eines Entscheidungsbaums Oder mehrerer EntscheidungsbSume aus einer zweiten Gruppe von 
EntschekJungsbSumen fur jeden Zeitrahmen in einer Spracheingabesequenz zur Bestimmung einer zweiten 
Wahrscheinlichkeitsverteilung, wobel die Wahrscheinlichkeitsverteilung eine Vertellung uber alle Klassen ist. 
die fur das korrekte Phonem mdglich sind, urn eine Klasse des schlimmsten Falls eines richtig erkannten 
Phonems einzuholen, Indem die Klasse des schlimmsten Falls als Klassenwert gewahit wird. bei dem die 
kumulative Wahrscheinlichkeitsverteilung der Klassen einen bestimmten Schwellenwert uberschreitet, 

Unter den Klassen des schlimmsten Falls Bestimmung zur Klasse des absolut schlimmsten Falls zwischen 
zwel beliebigen nebeneinander liegenden Phonemgrenzen der Klasse des schlimmsten Falls des richtig er- 
kannten Phonems zwischen den Phonemgrenzen; 
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Aussparung aller Phonemgrenzen, deren Klasse schlimmer 1st als diese Klasse des absolut schlimmsten Falls 
im aktuellen Segment; 

Erstellung einer Kurzliiste fQr das Segment; 

Ausgabe eines Erkennungsergebnisses als Antwort auf die Kurzliste. 

Eine Vorrichtung zur Spracherkennung, die folgendes umfaBt: 

a) Mittel zur Eingabe mehrerer TrainingsdatenwSrter; 

b) Mittel fur das Training eines cxJer mehrerer bindrer erster Entscheidungsbaume. um an jedem Knoten auf 
der Gnjndlage von Kontextdaten innerhalb der Tralnlngsdaten eine mdglichst Informative Frage zu stetlen, 
wobei jeder binSre erste Entscheidungsbaum einem anderen Zeitpunict In einer Sequenz der Tralnlngsdaten 

entsprechen Icann; 

c) Mittel fur das Durchlaufen eines Entscheidungsbaums fur jeden Zeitrahmen einer Spracheingabesequenz, 
um fQr jeden Zeitrahmen eine Wahrscheinlichlceitsverteiiung zu bestlmmen, wobei die Wahrscheinlichkeits- 
verteilung die Wahrschelnlichkelt ist, daB ein Knoten eine Phonemgrenze ist; 

d) M ittei fur den Vergleich der Wahrscheinlichkeiten der Zeitrahmen mit einem Schwellenwert zur Bestimmung 
einiger Zeitrahmen als Grenzen zwischen Phonemen; 

e) Mittel fur die Bereitsteiiung einer akustischen Trefferzahl fOr alle Phoneme zwischen Jedem gegebenen 
iSrenzenpaar; 

f) Mittel fur die Kiassifizierung der Phoneme auf der Grundlage dieser Trefferzahl; 

g) Mittel fQr die Ausgabe eines Erkennungsergebnisses in Abhangigkeit dieser Trefferzahl. 
Die Von-ichtung gemdB Anspruch 6, die weiterhin folgendes umfaBt: 

h) Mittel fur das Durchlaufen eines Entscheidungsbaums oder mehrerer Entscheidungsbdume aus einer zwel- 
ten Qruppe.yon Entscheidungsbdumen fOr jeden Zeitrahmen in einer Spracheingabesequenz zur Bestimmung 
einer zwelten Wahrscheinlichkeltsverteilung, wobei die Wahrscheinlichkeitsverteilung eine Verteilung Ober alle 
Klassen ist, die fur das konrekte Phonem mdglich sind, um eine Klasse des schlimmsten Fails eines richtig 
erkahnten Phonems einzuholen, Indem die Klasse des schlimmsten Falls als Klassenwert gewdhit wird. be! 
dem die kumulative Wahrscheinlichkeitsverteilung der Klassen etnen bestimmten Schwellenwert Qberschrel- 
tet; 

i) Unter den Klassen des schlimmsten Falls ein Mittel zur Bestimmung zur Klasse des absolut schlimmsten 
Falls zwischen zwei beliebigen nebeneinander liegenden Phonemgrenzen der Klasse des schlimmsten Fails 
des rIchtIg erkannten Phonems zwischen den Phonemgrenzen; 

j) Mittel zur Ausspamng aller Phonemgrenzen. deren Klasse schlimmer ist als diese Klasse des absolut 
schlimmsten Fails im aktuellen Segment; 

k) Mittel zur Erstellung einer Kurzliste fOr das Segment; 

1) Mittel zur Ausgabe eines Erkennungsergebnisses, wenn die Kurzliste des Erkennungsergebnisses eine 
Kurzliste von W6rtem ist 

Die Vomchtung gemSB Anspruch 6. die weiterhin folgendes umfaBt: 

h) Mittel fur das Durchlaufen eines Entscheidungsbaums oder mehrerer EntscheidungsbSume aus einer zwel- 
ten Gruppe von EntscheidungsbSumen fur jeden Zeitrahmen in einer Spracheingabesequenz zur Bestimmung 
einer zwelten Wahrscheinlichkeltsverteiiung. wobei die Wahrscheinlichkeitsverteilung eine Verteilung Ober alle 
Klassen ist, die fQr das korrekte Phonem mdglich sind, um eine Klasse des schlimmsten Falls eines rtehtig 
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erkannten Phonems einzuholen, indem die Klasse des schlimmsten Falls als Klassenwert gewShtt wird, bei 
dem die kumulative WahrscheinlichkeHsverteilung der Klassen eineri bestimmten Schwellenwert Qberschref- 
tet; 

I) Unter den Klassen des schlimmsten Falls ein Mittel zur Bestlmmung zur Klasse des absolut schlimmsten 
Falls zwtschen zwei beliebigen nebeneinander liegenden Phonemgrenzen der Klasse des schlimmsten Falls 
des richtig erkannten Phonems zwischen den Phonemgrenzen; 

j) Mittel zur Aussparung aller Phonemgrenzen, deren Klasse schlimmer 1st als diese Klasse des absolut 
schlimmsten Falls im aktuellen Segment; 

k) Mittel zur Erstellung einer Kurzliste der Phoneme fOr das Segment; 

I) Mittel fOr den Verglelch bestandteilbildender Phoneme eines Wortes in einem Vokabular, um festzustellen, 
ob das Wort in der Kurzliste enthalten ist und die Erstellung einer Kurzliste von Wdrtem; 

1) Mittel fur die Ausgabe eines Erkennungsergebnisses durch Vergleich der Wdrter aus der Kurzliste mit einem 
Sprachmodell, um die am meisten wahrscheinliche WortQberelnstimmung fOr die Spracheingangssequenz zu 
bestimmen. 

EIne Vonlchtung zur Spracherkennung. die folgendes umfal3t: 

a) Mittel zur EIngabe eines Strings von Sprachelementen, die Trainingsdaten darstellen; 

b) l\1ittel zur Umwandlung der Elemente der Trainingsdaten in elektrische Signale; 

c) MIttei zur Darstellung des elektrischen Signals der Trainingsdaten als prototyp<iuantisierte Eigenschafts- 
vektoren, wobei eIn Eigenschaftsvektor einen gegebenen Zeitrahmen darstellt; 

d) Mittel zur Zuordnung eines Klassenlabels fOr den prototyp-quantisierten Eigenschaftsvektor zu jedem Pro- 
totyp-Eigenschaftsvektor; 

e) Mittel zum Aufbau eines Oder mehrerer bindrer Entscheidungsbdume fOr unterschiedliche Zelten in den 
Trainingsdaten, wobei jeder Baum einen Wurzelknoten und eine Mehrzahl an Kindknoten aufweist, bestehend 
aus den folgenden Schrrtten: 

i, Mittel zur Bildung einer Gruppe von Trainingsaufzeichnungen, die 2K+1 PrSdiktoren. 1^ und eine vor- 
ausgesagte Klasse, p, umfassen, wobei die 2K+1 PrSdiktoren Eigenschaftsvektorlabels an 2K+1 aufein- 

anderfolgenden Zelten t-K t, .... t+K sind und die vorausgesagte Klasse eine blnSre Aufzeichnungs- 

anzeige darOber ist ob der Zeitpunkt t zu einer Phonemgrenze Im Fall des ersten Entscheidungsbaums 
gehdrt oder zum kon'ekten Phonem im Fall des zweiten Entscheidungsbaums gehdrt; 

il. Mittel zur Berechnung der geschatzten verbundenen Verteilung der Pradiktoren ^^ und des Phonems 
p fur 2K+1 PrSdIktoren unter Venvendung der Trainingsdaten, wobei die PrSdiktoren Eigenschaftsvek- 
torlabels zu den Zeilpunkten t-K, t t+K sind und p das Phonem zum Zeitpunkt 1 1st; 

iil. Mittel zur Speicherung der geschatzten verbundenen Verteilung von 1 und p und einer entsprechenden 
Verteilung fQr jeden Pradiktor am Wurzelknoten; 

Iv. Mittel zur Berechnung der besten Partitlonierung der Werte, die der Pradiktor l^^ fOr jedes l^^ annehmen 
kann, um die PhonemungewiBheit an jedem Knoten auf ein MindestmaB zu beschranken; 

V. Mittel zur Auswahl des Pradiktors 1^ dessen Partitlonierung zur niedrigsten UngewiBheit fuhrt, und 
Partitlonierung der Trainingsdaten in zwei Kindknoten, und zwar auf der Grundlage der computergesteu- 
erten Partitloniemng 1^, wobei jedem Kindknoten auf der Grundlage der Trainingsdaten am Kindknoten 
eine Klassenvertellung zugeordnet wird; 

f) Mittel zur Wiederholung der Bestimmung fur jeden Kindknoten, ob der Umfang an Trainingsdaten am Kind- 
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knoten gr66er ist als ein Schwellenwert; 

g) Mittel zur EIngabe eines Sprachelements, das erkannt warden soil; 

h) Mittel zur Umwandlung eines Sprachelements In ein elektrlsches Signal; 

i) Mittel zur Darstellung des elektrlschen Signals als Serle quantislerter Elgenschaftsvektoren; 

j) Mittel zum Vergleich der Serie quantislerter Eigenschaftsvektoren mit den gespeicherten Prototyp-Eigen- 
schaftsvektoren zur Bestimmung einer engsten Obereinstimmung und Zuordnung eines Eingangslabels zu 
jedem Vektor aus der Serie der Elgenschaftsvektoren entsprechend dem Label des am engsten Qberelnstim- 
nrienden Eigenschaftsvektors; 

k) Mittel fur das Durchlaufen eines Entscheidungsbaums fur jeden Zeitrahmen einer Spracheingabesequenz, 
urn fur jeden Zeitrahmen eine Wahrscheinlichkeltsverteilung zu bestirnmeii, wobei die Wahrscheinlichkeits- 
verteilung die Wahrscheiniichkeit ist, daB ein Knoten eine Phonemgrenze ist; 

1) Mittei fur den Verglek^h der Wahrschelnltehkeiten der Zeitrahmen mit einern Schwellenwert zur Bestimmung 
einiger Zeitrahmen als Grenzen zwischen Phonemen; 

m) Mittel zur Bereitstellung einer aicustischen Trefferzahl fOr alle Phoneme zwischen Jedem gegebenen Gren- 
zenpaan 

h) Mmel zur Klassifizlemng der Phonerne auf der Grundlage dieser Trefferzahl; 

o) Mittel zur Ausgabe.eines Erkennungsergebnisses in AbhSnglgkeit dieser Trefferzahl. 

10. Die Vonrlchtung gemdB Anspruch 9, die welterhin folgendes umfaBt 

Mittel fQr das Durchlaufen eines Entscheidungsbaums oder mehrerer EntscheidungsbSume aus einer zweiten 
Gruppe von Entscheidungsbdumen fQr jeden Zeitrahmen in einer Spracheingabesequenz zur Bestimmung 
einer zweiten Wahrscheinllchkeitsverteilung, wobei die vyahrscheinllchkeitsverteiiung eine Verteilung uber alle 
Kiassen ist, die fQr das kon-ekte Phonem mdglich sind. um eine Kiasse des schiimmsten Falls eines richtig 
erkannten Phonems einzuholen, indem die Kiasse des schiimmsten Falls als Klassenwert gewfihit wird, bel 
dem die kumulatlve Wahrscheinlichkeltsverteilung der Kiassen einen bestlmmten Schwellenwert Qberschrei- 
tet; 

Unter den Kiassen des schiimmsten Falls ein Mittel zur Bestimrfiung zur Kiasse des absolut schiimrristen Falls 
zwischen zwei beliebigen nebeneinander liegenden Phonemgrenzen der Kiasse des schiimmsten Falls des 
richtig erkannten Phonems zwischen den Phonemgrenzen; 

Mittel zur Aussparung alter Phonemgrenzen, deren Kiasse schllmmer ist als diese Kiasse des absolut schiimm- 
sten Falls im aktuellen Segment; 

Mittei zur Erstellung einer Kurzliste fur das Segment; 

Mittel zur Ausgabe eines Erkennungsergebnisses in Reaktion auf die Kurzliste. 



Revendications 

1. M^thode de reconnaissance de la parole, comportant les phases qui consistent k : 

a) entrer une plurality de mots des donn^es de formation ; 

b) fomrier un ou plusleurs premiers arbres binalres de d^ision pour poser la question la plus instructive k 
cheque noeud, en se basant sur I'information contextuelle dans tes donn^es de formation, ou cheque premier 
arbre binaire de d^islon peut correspondre k un temps different dans une sequence des donn^es de fomna- 
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tlon; 

c) traverser un des arbres de decision pour chaque tranche de temps d'une sequence d'entr^e du discours 
pour determiner une distribution de probability pour chaque tranche de temps, la distribution de probabititd 
etant la probability qu'un noeud soit une limlte de son ; 

d) comparer les probabilit6s assocl6es aux tranches de temps & un seuil pour Identifier quelques tranche de 
temps comme limites de sons ; 

e) foumir un r^sultat acoustique pour toutes les limites de temps entre chaque paire donn^e de Iimrtes ; 

f) classer les sons sur ta base de ce r^sultat; 

g) sortir un r^sultat de reconnaissance en r^ponse au rysultat^ 

M^thode selon la revendication 1 comprenant en outre les phases qui consistent k: 

h) parcourlr un ou plusieurs d'un deuxi^me ensemble d'arbres de dyclsioh pour chaque tranche de temps sur 
une sequence d'entr^e du discours pour dytemniner une deuxi^me distribution de probability, la probability 
de distribution ytant une distribution au-dessus de tous les rangs possibles que le son correct peut avoir pour 
obtenir un plus classement parmi les pires d'un son correctement identifiy en choisissant le pire classement 
comme valeur type k laqueile la distribution de probability cumulative des classes dypasse un seuil prycis ; 

i) attribuer comme le pire classement possible parmi les classements les plres entre deux limites de son 
adjacentes quelconques, le pire classement du son correctement identifiy entre les limites du son ; 

j) nygiiger tous les sons dont le rang est plus mauvals que ce pire classement absolu dans le segment couraht ; 

k) faire une courte tiste des sons pour le segment; . 

I) sortir un rysultat d'identiflcation si la iiste courte du rysultat d'identification ytant une liste courte de mots. 

Mythode selon la revendication 1 , comprenant en outre les phases qui consistent k: 

h) traverser un ou plusieurs arijres d'un deuxiyme ensemble d'arbres de dycision pour chaque tranche de 
temps sur une sequence d'entrye du discours pour determiner une deuxiyme distribution de probability, la 
distribution de probability ytant une distribution sur tous les rangs possibles qu'un son peut prendre pour 
obtenir un pire classement d'un son correctement identifiy en choisissant le pire classement comme valeur 
type k laqueile la distribution de probability cumulative des types dypasse un seuil precis ; 

1) attribuer en tant que pire classement possible de tous les pires classements entre deux limites de sons 
adjacentes quelconques. le pire classement du son con-ectement reconnu entre les limites du son ; 

j) nygiiger tous les sons dont le rang est plus mauvais que ce pire classement absolu dans le segment courant ; 

k) faire une courte liste des sons pour le segment; 

I) comparer les sons constituent un mot avec un vocabuiaire pour voir si le mot se trouve dans fa tiste courte 
et ytablir une courte liste de mots ; 

m) soriir un resultat de reconnaissance en comparant les mots de la liste courte avec un modeie de langue 
pour detenminer la concordance la plus probable d'un mot pour la sequence de parole entree. 

Methode de reconnaissance de la parole, comprenant les phases qui consistent k : 

a) entrer un suite d'expressions constituent des donnees de formation ; 

b) convertir les expressbns des donnyes de formation en signaux yiectriques ; 
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c) repr^senter du signal ^lectrique des donn^es de formation par des vecteurs prototypes, un vecteur repr^ 
sentant une tranche de temps donn^e ; 

d) attribuer ^ chaque vecteur de caract^ristique prototype, une Etiquette de classe associ^ au vecteur pro- 
totype de caract6rlstique quantif j6e ; 

e) former un ou plusieurs arbres de decision binaires pour diff^rents temps dans les donn^es de fonmation, 
chaque arbre ayant un hoeud racine et une plurality de noeuds enfants, comprenant les phases sulvantes : 

j) cr^r un ensemble d'enregistrements de fomnation comportant 2K+1 les pr^dlseurs, Ik, et une classe 
pr6vue, p, ou les pr6diseurs 2K+1 sont des etiquettes de vecteur de dlspositif 2K+1 aux temps cons6cutifs 

t-K t,..., t,..., :t+K, et la classe pr^vue est un Indicateur d'enregistrement binaire si le temps t est associ6 

k une llmite de son dans le cas du premier arbre de decision ou est associd au son correct dans le cas 
du deuxi&me arbre de decision; 

. ii. calculer la distribution commune estim6e des pr§dlseurs 1^ et du son p pour les pr^diseurs 2K+1 en 
utilisant les donn^es de formation, oD les pr^diseurs sont les etiquettes de vecteur de caract6ristiques 
aux temps t-K...., t t-fK et p est le son au temps t; 

ill. enregistrer la distribution commune estim^e de 1*^ et de p et une distribution correspondante pour 
chaque pr^diseuni^ au noeud racine; 

iv. calculer la meilleure division des valeurs que le pr^diseur l^^ peut prendre pour chaque 1 ^ pour minimiser 
rincertitude sur le son k chaque noeud ; 

V. choisir le pr6diseur 1 ^ dont la division donne I'incertitude la plus f aible et diviser les donn6es de formation 
en deux noeuds enfants en se basant sur le 1^ la division calcut6e par ordinateur, k chaque noeud enfant 
etant attribute une distribution de type sulvant les donn^es de fonmation au noeud enfant; 

f) r^peter pour chaque noeud enfant si la quantity de donn^es de formation au noeud enfant est sup^rieure 
dunseuil; 

g) entrer une expression k reconnaftre ; 

h) convertir Texpresslon en un signal eiectrique ; 

I) repr^senter le signal 6lectrique par une s6rie de vecteurs de caract6ristiques quantifi§es; 

j) comparer la s^rie des vecteurs de caracterlstiques quantifiees avec les yecteurs prototypes des caracterls- 
tiques enreglstr^s pour determiner la concordance la plus proche et assignor une etiquette d'entree k chacun 
des vecteurs de caracteristlques de la serle correspondant k retiquette du vecteur prototype de caracteristlque 
de plus grande concordance; 

k) traverser un des arisres de decision pour chaque tranche de temps d'une sequence d'entree d'une parole 
pour determiner une distribution de probabilfte pour chaque tranche de temps, la distribution de probabilitei 
etant la probabilite qu'un noeud soit une limite de son ; 

I) comparer les probabilites associees aux tranches de temps avec un seuil, pour Identifier certain'es tranches 
de temps en tant que timites entre des sons ; 

m) foumir un resultat acoustique pour tous les sons entre chaque paire donnee de limltes ; 
n) classer les sons sur la base de ce resultat ; 
0) sortir un resultat d'ldentification en reponse au dit resultat. 
Methode selon revendication 4 comprenant en outre les phases qui consistent k : 
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traverser un ou plusleurs d'un deuxi^me ensemble d'arbres de d6cision pour chaque tranche de temps sur 
une s^uence de paroles d'entr^e pour determiner une deuxi^me distribution de probabilrt^, la distribution de 
probability 6tant une distribution au-dessus de tous les rangs possibles qu' un son peut prendre pour obtenir 
un classement parmi les pires d'un son correctement identifi6 en choisissant le pire classement comma valeur 
de classe k laquelle distribution de probability cumulative des classes d^passe un seull precis ; 

attribuer comme le pire classement absolu parmi les pires classements entre deux limites adjacentes de son, 
te pire classement du son conrectement identlfiy, entre les limites de son; rejeter toutes les limites de son dont 
le rang est plus mauvals que ce pire classement absolu dans le segment oourant; 

faire une llste courte pour le segment; 

sortir un r6sultat d'Jdentificatlon en r^ponse k la liste courte. 

Appareii pour la reconnaissance de la parole, comportant: 

a) un moyen pour entrer une plurality de mots des donn^es de formation ; 

b) un moyen pour former un ou plusleurs premiers arbres binalres de decision pour poser la question la plus 
instructive k chaque noeud, en se basant sur I'information contextueile dans les donn^es de fomiation, oCi 
chaque premier arbre binaire de decision peut correspondre k un temps different dans une sequence des 
donn§es de formation; 

c) un moyen pour traverser un des arbres de decision pour chaque tranche de temps d'une sequence d'entr^e 
du discours pour determiner une distribution de probability pour chaque tranche de temps, la distribution de 
probability ytant la probability qu'un noeud soit une limite de son ; 

d) un moyen pour comparer les probabilitys associyes aux tranches de temps k un seull pour identifier quel- 
ques tranches de temps comme limites de sons ; 

e) un moyen pour foumir un rysultat acoustique pour toutes les limites de temps entre chaque paire donnye 

de limites ; 

f) un moyen pour classer les sons sur la base de ce rysultat; 

g) un moyen pour sortir un rysultat de reconnaissance en ryponse au rysultat. 
Appareii selon la revendication 6 comprenant en outre : 

h) un moyen pour parcourir un ou plusleurs d'un deuxiyme ensemble d'arbres de dycision pour chaque tranche 
de tenfips sur une syquence d'entrye du discours pour dytemniner une deuxi^me distribution de probability, 
la probability de distribution ytant une distribution au-dessus de tous les rangs possibles que le son correct 
peut prendre pour obtenir un plus classement parmi les pires d'un son correctement identifiy en choisissant 
le pire classement comme valeur type k laquelle la distribution de probability cumulative des classes dypasse 
un seull prycis ; 

I) un moyen pour attribuer comme le pire classement possible panmi les classements les pires entre deux 
limites de son adjacentes quelconques. le pire classement du son correctement identifiy entre les limites du 
son; 

j) un moyen pour nygliger tous les sons dont le rang est plus mauvais que ce pire classement absolu dans te 
segment courant ; 

k) un moyen pour faire une courte liste des sons pour le segment; 

I) un moyen pour sortir un rysuifat d'identification si la liste courte du rysultat d'identification ytant une liste 
courte de mots. 
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Appareil selon la revendication 6, comprenant en outre : 

h) un moyen pour traverser un ou plusieurs arbres d'un deuxidme ensemble d'arbres de decision pour cheque 
tranche de temps sur une sequence d'entrde du discours pour determiner une deuxi^me distribution de pro- 
bability, la distribution de probability ytant une distribution sur tous les rangs possibles qu'un son peut prendre 
pour obtenir un pire classement d'un son correctement identifid en choisissant ie pire classement comme 
valeur type k laquelle la distribution de probability cumulative des types d6passe un seull prycis ; 

i) un moyen pour attrlbuer en tant que pire classement possible de tous les pires classements entre deux 
llmltes de sons adjacentes quelconques, Ie pire classement du son con-ectement reconnu entre les limites du 
son; 

j) un moyen pour nygllger tous les sons dont Ie rang est plus mauvais que ce pire classement absolu dans Ie 
segment courant ; 

k) un moyen pour faire une courte liste des sons pour Ie segment; 

I) un moyen pour comparer les sons constituent un mot avec un vocabulaire pour voir si le mot se trouve dans 
la liste courte et ytablir une courte liste de mots ; 

m) un moyen pour sortir un rysultat de reconnaissance en comparant ies mots de la liste courte avec un 
module de langue pour dytemiiner la concordance la plus probable d'un mot pour ta syquence de parole entrye. 

Appareil de reconnaissance de la parole, comprenant : 

a) un moyen pour entrer un suite d'expressions constituent des donnyes de fomiation; 

b) un moyen pour convertir les expressions des donnyes de formation en signaux yiectrlques ; 

c) un moyen pour repryseriter le signal yiectrique des donnyes de fomnatlon par des vecteurs prototypes* un 
vecteur reprysentant une tranche de temps donnye ; 

d) un moyen pour attribuer k chaque vecteur de caractyristique prototype, une ytlquette de classe associyfe 
au vecteur prototype de caractyristique quantifiye ; 

e) un moyen pour former un ou plusieurs arbres de dyclslon binaires pour diffyrents temps dans les donnyes 
de fonmatlon, chaque arbre ayant un noeud raclne et une plurality de noeuds enfants, comprenant les phases 
suivantes: 

I) un moyen pour cryer un ensemble d'enreglstrements de fomiation comportant 2K+1 les prydiseurs, 1\ 
et une classe pryyue. p, ou les prydiseurs 2K+1 sont des ytlquettes de vecteur de disposltif 2K+1 aux 

temps consycutlfs t-K t t,..., :t+K, et la classe pryvue est un Indicateur d'enreglstrement binaire si 

le temps t est associy k une limite de son dans le cas du premier arbre de dyclslon ou est associy au son 
con-ect dans le cas du deuxiyme arbre de dyclslon; 

ii. un moyen pour calculer la distribution commune estlmye des prydiseurs ^^ et du son p pour les prydi- 
seurs 2K+1 en utilisant tes donnyes de fbnnation, oD les prydiseurs sont les ytlquettes de vecteur de 
caractyristiques aux temps t-K,..., t t+K et p est le son au temps t; 

III. un moyen pour enreglstrer la distribution commune estlmye de 1*^ et de p et une distribution corres- 
pondante pour chaque prydiseur 1*^ au noeud raclne; 

Iv. un moyen pour calculer la meilleure division des valeurs que le prydiseur l^^ peut prendre pour chaque 
1^ pour minimiser I'incertitude sur le son k chaque noeud ; 

V. un moyen pour choisir le prydiseur dont la division donne i'incertitude la plus faible et divisor les 
donnyes de fomiation en deux noeuds enfants en se basant sur le 1*^ la division calcuiye par ordinateur, 
k chaque noeud enfant ytant attribuye une distribution de type sulvant les donnyes de formation au noeud 
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enfant; 

f) un moyen pour r6p6ter pour chaque noeud enfant si la quantity de donn^es de formation au noeud enfant 
est sup6rieure k un seuil ; 

g) un moyen pour entrer une expression k reconnaftre ; 

h) un moyen pour convertir {'expression en un signal ^lectrlque ; 

I) un moyen pour representor le signal ^lectrique par une s^rie de vecteurs de caract6rlstiques quantlfi^es ; 

J) un moyen pour comparer la s^rie des vecteurs de caract6ristlques quantifi^es ayec les vecteurs prototypes 
des caract^ristiques enreglstr^s pour determiner la concordance la plus proche et assigner une etiquette 
d'entree k chacun des vecteurs de caracteristiques de la sdrie conrespondant k retiquette du vecteur prototype 
de caracteristique de plus grande concordance ; . 

k) un moyen pour traverser un des arbres de decision pour cheque tranche de temps d'une sequence d'entree 
d'une parole pour detemniner une distribution de probabilite pour chaque tranche de temps, la distribution de 
probabllite etant la probabilite qu'iin noeud soit une limtte de son ; 

I) un moyen pour comparer les probabilites associees aux tranches de temps avec un seuil, pour identifier 
certain'es tranches de temps en tant que limites entre des sons; 

m) un moyen pour fournir un resultat acoust|que pour tous les sons entre chaque paire donnee de limites ; 

n) un moyen pour classer les sons sur la base de ce resultat ; 

o) un moyen pour sortir un resultat de reconnaissance en reponse au dit resultat. 

10. Appareil selon revendication 9 comprenant en outre : 

un moyen pour traverser un ou plusteurs d'un deuxieme ensemble d'arbres de decision pour chaque tranche 
de temps sur une sequence de paroles d'entree pour detennlner une deuxr^me distribution de probabilite, la 
distribution de prot>ablllte etant une distribution au-dessus de tous les rangs possibles qu' un son peut prendre 
pour obtenir un classement panni les pires d'un son correctement identifie en choisissant le pire classement 
comma valeur de classe k laquelle distribution de probabilite cumulative des classes depasse un seuil precis ; 

un moyen pour attribuer comma le pire classement absolu parmi les pires classements entre deux limites 
adjacentes de son, le pire classement du son con-ectement identifie, entre les limites de son; 

un moyen pour Ignorer toutes les limites de son dont le rang est plus mauvais que ce pire classement absolu 
dans le segment courant; 

un moyen pour faire une liste courte pour le segment; 

un moyen pour sortir un resultat d'identif ication en reponse k la liste courte. 
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